[Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes by justinyeh1995 · Pull Request #4399 · ray-project/kuberay

justinyeh1995 · 2026-01-15T13:25:19Z

Why are these changes needed?

The submitter sidecar may exit during transient head‑node spikes.

The fix has two layers:

Level	What it does	K8s version	KubeRay feature gate
1	Consults the Ray dashboard for the job’s actual status before marking the RayJob as failed	All	Always on
2	Enables per-container restart rules for the submitter container so non-zero exits restart the container and re-attach the log	1.34+ (and cluster ContainerRestartRules enabled)	SidecarSubmitterRestart

f SidecarSubmitterRestart is enabled on a cluster with K8s < 1.34, the operator will fail fast at startup.

Related issue number

Closes #4285

Testing

Level 1: kind v1.26 (less than 1.34) / no feature gate enabled

Create a kind cluster with node image v1.26 and follow the guide to build the image and load it into kind cluster

kind create cluster --name test-cluster --image kindest/node:v1.26.0

cd ray-operator
IMG=kuberay/operator:nightly make docker-build

kind load docker-image kuberay/operator:nightly --name test-cluster

Apply manifest, a slightly modified version of ray-job.sidecar-mode.yaml

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sidecar-mode
spec:
  # In SidecarMode, the KubeRay operator injects a container into the Ray head Pod to submit the Ray job and tail logs.
  # This will avoid inter-Pod communication, which may cause network issues. For example, some users face WebSocket hangs.
  # For more details, see https://github.com/ray-project/kuberay/issues/3928#issuecomment-3187164736.
  submissionMode: "SidecarMode"
  entrypoint: python /home/ray/samples/sample_code.py
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  rayClusterSpec:
    rayVersion: '2.52.0'
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.52.0
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            resources:
              limits:
                cpu: "1"
                memory: "5Gi"
              requests:
                cpu: "1"
                memory: "2Gi"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: code-sample
          volumes:
          - name: code-sample
            configMap:
              name: ray-job-code-sample
              items:
              - key: sample_code.py
                path: sample_code.py
    workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 5
      groupName: small-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "1Gi"
              requests:
                cpu: "1"
                memory: "1Gi"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests
    import time

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    # Verify that the correct runtime env was used for the job.
    assert requests.__version__ == "2.26.0"
    
    # keep job alive long enough to kill submitter mid-run
    print("Entering long-running phase (5 minutes)...")
    for i in range(300):
        ray.get(counter.inc.remote())
        if i % 10 == 0:
            print(f"tick={i}, {ray.get(counter.get_counter.remote())}")
        time.sleep(1)

    print("Done.")

Watch RayJob status till it becomes running

kubectl get rayjob rayjob-sidecar-mode -w

. Disrupt the sidecar container

CLUSTER=$(kubectl get rayjob rayjob-sidecar-mode -o jsonpath='{.status.rayClusterName}')
HEAD_POD=$(kubectl get pods -l ray.io/cluster=$CLUSTER,ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')

# we cannot kill pid 1 within the container 
# e.g. kubectl exec $HEAD_POD -c ray-job-submitter -- sh -c 'pkill -f "ray job log"'
# so we stop the container instead

CONTAINER_ID=$(kubectl get pod $HEAD_POD -o jsonpath='{.status.containerStatuses[?(@.name=="ray-job-submitter")].containerID}' | sed 's|containerd://||')
docker exec -it test-cluster-control-plane crictl stop $CONTAINER_ID

Verify RayJob is still running

kubectl get rayjob rayjob-sidecar-mode -o jsonpath='{.status.jobDeploymentStatus}'

Screen.Recording.2026-02-04.at.9.16.01.PM.-.Compressed.with.FlexClip.mp4

kind v1.26 (less than 1.34) / feature gate enabled

Create KIND v1.26 cluster and load the image into the cluster

kind create cluster --name test-cluster --image kindest/node:v1.26.0
kind load docker-image kuberay/operator:nightly --name test-cluster

Install operator with feature gate enabled:

helm upgrade --install kuberay-operator ../helm-chart/kuberay-operator \
  --set image.repository=kuberay/operator \
  --set image.tag=nightly \
  --set featureGates\[0\].name=SidecarSubmitterRestart \
  --set featureGates\[0\].enabled=true

k get pods
# It is expected that the operator exits with CrashLoopBackOff

k logs kuberay-operator-58f4998f5d-2jc6k
# and the logs mention SidecarSubmitterRestart feature gate requires K8s 1.34+

kind v.134+ / feature gate enabled

Create a cluster with v1.34+ images and ContainerRestartRules enabled, v1.35 enable it by default

kind create cluster --name test-cluster --image kindest/node:v1.35.0
kind load docker-image kuberay/operator:nightly --name test-cluster

Enable the feature gate for the ray operator

helm upgrade --install kuberay-operator ../helm-chart/kuberay-operator \
  --set image.repository=kuberay/operator \
  --set image.tag=nightly \
  --set featureGates\[0\].name=SidecarSubmitterRestart \
  --set featureGates\[0\].enabled=true

apply the same manifest pasted above
record the job id

JOB_ID=$(kubectl get rayjob rayjob-sidecar-mode -o jsonpath='{.status.jobId}')

disrupt the sidecar container

CLUSTER=$(kubectl get rayjob rayjob-sidecar-mode -o jsonpath='{.status.rayClusterName}')
HEAD_POD=$(kubectl get pods -l ray.io/cluster=$CLUSTER,ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')

CONTAINER_ID=$(kubectl get pod $HEAD_POD -o jsonpath='{.status.containerStatuses[?(@.name=="ray-job-submitter")].containerID}' | sed 's|containerd://||')
docker exec -it test-cluster-control-plane crictl stop $CONTAINER_ID

verify RayJob does not fail

kubectl get rayjob rayjob-sidecar-mode -o jsonpath='{.status.jobDeploymentStatus}'

verify the submitter container actually restarted

kubectl get pod $HEAD_POD -o jsonpath='{range .status.containerStatuses[*]}{.name}{" restartCount="}{.restartCount}{"\n"}{end}'

# the restartCount should +1

verify the Ray job is still running with the same job id

kubectl exec $HEAD_POD -c ray-head -- ray job status --address=http://127.0.0.1:8265 "$JOB_ID"

Checks

Testing Strategy
- Unit tests
- Manual tests

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

Future-Outlier

we should make sure this only be used when k8s version >= 1.34
we should also have some mechanism to check user open the alpha feature via feature gate.

justinyeh1995 · 2026-01-16T11:12:09Z

we should make sure this only be used when k8s version >= 1.34

we should also have some mechanism to check user open the alpha feature via feature gate.

Appreciate the concrete suggestions.

I am wondering whether we should keep the current fix (checking dashboard before applying timeout) as a fallback for users with k8s version < 1.34.

Or should we simply focus on solving it for k8s version >= 1.34. My judgement is these two can co-exist.

justinyeh1995 · 2026-01-17T07:15:36Z

After discussing offline with @Future-Outlier, this pr will instead focus on implementing restarting submitter for k8s version 1.34+.

…er-container restart Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

…test Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

…rt feature gate is enabled at operator startup Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

justinyeh1995 · 2026-01-31T07:38:51Z

ray-operator/controllers/ray/rayjob_controller.go

+					Operator: corev1.ContainerRestartRuleOnExitCodesOpNotIn,
+					Values:   []int32{0},


is it ok if we restart on any non-zero error code?

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

…idecarmode-terminal-condition

ray-operator/controllers/ray/utils/util.go

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

ray-operator/controllers/ray/rayjob_controller.go

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

ray-operator/controllers/ray/rayjob_controller.go

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

justinyeh1995 · 2026-02-07T02:04:37Z

cc @seanlaii @troychiu to review if you get a chance to, thanks a lot!

[Fix][1/N] defer sidecar submitter failure until dashboard check

820921f

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

justinyeh1995 changed the title ~~[WIP][Fix][RayJob SidecarMode] prevent premature job termination during transient head node spikes~~ [WIP][Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes Jan 15, 2026

Future-Outlier reviewed Jan 16, 2026

View reviewed changes

justinyeh1995 added 2 commits January 24, 2026 08:56

[Feature][SidecarMode] Add SidecarSubmitterRestart feature gate for p…

51f4905

…er-container restart Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

[Feature][SidecarMode] Add SidecarSubmitterRestart feature gate unit …

2ac3688

…test Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

justinyeh1995 changed the title ~~[WIP][Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes~~ [Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes＝ Jan 24, 2026

[Feature][SidecarMode] Add K8s version check if SidecarSubmitterResta…

9436606

…rt feature gate is enabled at operator startup Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

justinyeh1995 changed the title ~~[Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes＝~~ [Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes Jan 28, 2026

justinyeh1995 commented Jan 31, 2026

View reviewed changes

justinyeh1995 added 3 commits January 31, 2026 16:05

[Chore] better explainations

015fbca

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

[Chore] remove low value unit test

b3f25d5

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

Merge remote-tracking branch 'upstream/master' into fix/4285-rayjob-s…

0eb6e12

…idecarmode-terminal-condition

justinyeh1995 marked this pull request as ready for review January 31, 2026 08:16

justinyeh1995 requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners January 31, 2026 08:16

cursor bot reviewed Jan 31, 2026

View reviewed changes

ray-operator/controllers/ray/utils/util.go Show resolved Hide resolved

[Chore] remove incorrect nil check

63d9d34

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

cursor bot reviewed Feb 1, 2026

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

cursor bot reviewed Feb 1, 2026

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

[Fix] SidecarMode should not fail job based on container exit status

d148462

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

justinyeh1995 requested a review from Future-Outlier February 4, 2026 10:35

justinyeh1995 marked this pull request as draft February 4, 2026 10:39

justinyeh1995 marked this pull request as ready for review February 4, 2026 10:40

Copilot AI mentioned this pull request Feb 5, 2026

Review all open pull requests #4482

Closed

4 tasks

CheyuWu self-requested a review February 6, 2026 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes#4399

[Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes#4399
justinyeh1995 wants to merge 9 commits intoray-project:masterfrom
justinyeh1995:fix/4285-rayjob-sidecarmode-terminal-condition

justinyeh1995 commented Jan 15, 2026 •

edited

Loading

Uh oh!

Future-Outlier left a comment

Uh oh!

justinyeh1995 commented Jan 16, 2026

Uh oh!

justinyeh1995 commented Jan 17, 2026

Uh oh!

justinyeh1995 Jan 31, 2026

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

justinyeh1995 commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		Operator: corev1.ContainerRestartRuleOnExitCodesOpNotIn,
		Values: []int32{0},

Conversation

justinyeh1995 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Testing

Level 1: kind v1.26 (less than 1.34) / no feature gate enabled

kind v1.26 (less than 1.34) / feature gate enabled

kind v.134+ / feature gate enabled

Checks

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

justinyeh1995 commented Jan 16, 2026

Uh oh!

justinyeh1995 commented Jan 17, 2026

Uh oh!

justinyeh1995 Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinyeh1995 commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinyeh1995 commented Jan 15, 2026 •

edited

Loading